Topic 9 - Multivariate analyses - Part B: Ordination

library(vegan)
library(ade4)
library(ape)

source ('https://www.dipintothereef.com/uploads/3/7/3/5/37359245/cleanplot.pca.r')
source ('https://www.dipintothereef.com/uploads/3/7/3/5/37359245/evplot.r')
knitr::purl("Topic_8_multivariate.Rmd", documentation = F)

Background

GUSTA ME provides a guide to statistical analysis (community-focused) in Microbial Ecology. Very useful to help in the choice of the relevant analysis according to the type of data you are dealing with.

Multidimensional space

The aim of ordination methods is to represent the data along a reduced number of orthogonal axes, constructed in such way that they represent, in decreasing order, the main trends of the data

  • The rends can be interpreted visually or in association with other methods such as clustering or regression

  • Most ordination methods (except nMDS) are based on the extraction of the eigenvectors of an association matrix

Families

Two big families of ordination analyses exist according to how they are dealing with environmental matrix (if any):

  • unconstrained ordination (indirect gradient analysis, ordination not constrained by environmental factors).They are descriptive methodologies and describe patterns. It generates hypotheses but cannot test them.

    • uncover main compositional gradients in the species data, structuring the community, and these gradients can be interpreted by known (estimated or measured) environmental factors

    • environmental variables can be used a posteriori, after the analysis

  • constrained ordination (direct gradient analysis, ordination axes are constrained by environmental factors). It tests directly hypotheses about the influence of environmental factors on species composition*

    • relates the species composition directly to the environmental variables and extracts the variance in species composition which is directly related to these variable*

    • regarding environmental factors, it offers several interesting options such as step-wise selection of important environmental variables (and excluding those which are not relevant for species composition), test of significance of the variance explained by environmental factors and partitioning variance explained by particular environmental variables

Types

In addition, based on data input, two types of ordination analyses exist.

  • raw data: based on analysis of raw sample-species matrices with abundance or presence/absence data. Two categories recognized, differing by assumption of species response along environmental gradient:

    • linear, species response linearly along env. gradient, which could be true for rather homogenous ecological data, where ecological gradients are not too long. Short gradient.

    • unimodal, species response unimodally along gradient, having its optima at certain gradient position. More close to reality of ecological data, more suitable for heterogenous dataset (long gradients + many zeros and turnover). Long gradient.

  • distances: distance matrix computed by similarity/dissimilarity measures, and projecting these distances into two or more dimensional diagrams

Framework

The selection toward linear or unimodal ordination appraoch is commonly determined by a rule of thumb by performing a Detrended Correspondance Analysis (DCA) decorana on the dataset, then to check the length of the 1st DCA axis. Typically, it says:

  • length>4, data are heterogenous and you should use unimodal methods

  • length<3, data are homogenous and you should use linear methods

There is a grey zone between 3 and 4 where both methods are okay - in addition if your data are heterogenous, you still can use PCA/RDA using Hellinger’s transformation of species data* (tbPCA)

data(varespec)
decorana(varespec)
## 
## Call:
## decorana(veg = varespec) 
## 
## Detrended correspondence analysis with 26 segments.
## Rescaling of axes with 4 iterations.
## 
##                   DCA1   DCA2    DCA3    DCA4
## Eigenvalues     0.5235 0.3253 0.20010 0.19176
## Decorana values 0.5249 0.1572 0.09669 0.06075
## Axis lengths    2.8161 2.2054 1.54650 1.64864
data(doubs)
doubspec<-doubs$fish[-8,]
decorana(doubspec)
## 
## Call:
## decorana(veg = doubspec) 
## 
## Detrended correspondence analysis with 26 segments.
## Rescaling of axes with 4 iterations.
## 
##                  DCA1   DCA2    DCA3    DCA4
## Eigenvalues     0.538 0.1964 0.08759 0.17777
## Decorana values 0.601 0.1283 0.06111 0.03293
## Axis lengths    3.855 1.4672 2.39227 1.79048

Unconstrained ordinations

  • Principal Component Analysis (PCA): the main eigenvector-based method (and the most famous). Works on raw, quantitative data. Preserve the Euclidean distance among sites.

  • Correspondance Analysis (CA): works on data that must be frequency or frequency-like, dimensionnally homogenous, and non-negative. Preserve the Chi-square distance among rows and columns. Applied in ecology to analyze species data.

  • Principal Coordinate Analysis (PCoA, also called MDS): devoted to the ordination of distance matrix, most often in Q mode. Great flexibility in the choice of association measures. Common in trait-based approach because of it.

  • non-metric Multi-Dimensional Scaling (nMDS) unlike the three others, this is not an eigenvector-based method. nMDS tries to represent the set of objects along a predetrmined number of axes (usually 2) while preserving the ordering relationship among them.

Principal Component Analysis (PCA)

A PCA carries out a rotation system of axes defined by the variables, such as new axes (called principal component) are orthogonal to one another, and correspond to the successive dimensions of maximum variance of the scatter plot.

A PCA will find the “best” line (first principal component) according to two different criteria of what is the “best”: maximize the variance & minimize the error.

Applied to a 2D example on wine testing, this is like to identify a new property of wine by a combination of two variables (alcohol content and wine darkness) by drwaing a line through the center of the wine cloud.

Scatterplot wine properties

Maximizing Variance and minimizing errors It corresponds to a simple application of the Pythagora theorem

The direction is given by eigenvector, and the strength by the eigenvalues.

Check it here for a very simple explanation of how PCA works, and here for a simple example, and here for a more detailed example.

Computation

In R, a PCA can be computed with the function rda from the veganamong many other options.

# PCA on on the full data varechem dataset
# arg scale =T, standardize our variables within the rda function
data(varechem)
env<-varechem
env.pca<-rda(env, scale=T) 
env.pca
## Call: rda(X = env, scale = T)
## 
##               Inertia Rank
## Total              14     
## Unconstrained      14   14
## Inertia is correlations 
## 
## Eigenvalues for unconstrained axes:
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9  PC10  PC11  PC12  PC13 
## 5.192 3.193 1.686 1.069 0.816 0.706 0.436 0.369 0.171 0.150 0.085 0.070 0.035 
##  PC14 
## 0.024
summary(env.pca) # default scaling 2
## 
## Call:
## rda(X = env, scale = T) 
## 
## Partitioning of correlations:
##               Inertia Proportion
## Total              14          1
## Unconstrained      14          1
## 
## Eigenvalues, and their contribution to the correlations 
## 
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Eigenvalue            5.1916 3.1928 1.6855 1.06899 0.81599 0.70581 0.43641
## Proportion Explained  0.3708 0.2281 0.1204 0.07636 0.05829 0.05042 0.03117
## Cumulative Proportion 0.3708 0.5989 0.7193 0.79564 0.85392 0.90434 0.93551
##                           PC8    PC9    PC10    PC11    PC12     PC13     PC14
## Eigenvalue            0.36884 0.1707 0.14953 0.08526 0.06986 0.035057 0.023615
## Proportion Explained  0.02635 0.0122 0.01068 0.00609 0.00499 0.002504 0.001687
## Cumulative Proportion 0.96185 0.9740 0.98473 0.99082 0.99581 0.998313 1.000000
## 
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## * General scaling constant of scores:  4.236078 
## 
## 
## Species scores
## 
##              PC1     PC2      PC3      PC4       PC5      PC6
## N         0.2441  0.2516 -0.23818  0.92846 -0.359554  0.30319
## P        -0.9181 -0.4194  0.13356  0.09649  0.224909  0.07938
## K        -0.9369 -0.3272 -0.06261  0.25051  0.180899 -0.28683
## Ca       -0.9280 -0.1711  0.47982 -0.02138 -0.241479 -0.09536
## Mg       -0.9087 -0.1978  0.12446 -0.04430 -0.497428 -0.10781
## S        -0.8576 -0.6002 -0.31620 -0.01652  0.071629 -0.08779
## Al        0.2980 -0.9720 -0.39009  0.09092  0.005386 -0.19467
## Fe        0.5236 -0.7748 -0.23063  0.37187 -0.023377 -0.32292
## Mn       -0.7418  0.3908  0.13114  0.44346  0.501064  0.06776
## Zn       -0.8926 -0.3277  0.06142 -0.06121 -0.153677  0.51653
## Mo       -0.1072 -0.4625 -0.85720 -0.27258 -0.030063  0.39721
## Baresoil -0.4218  0.6387 -0.40089 -0.07389 -0.416610 -0.31060
## Humdepth -0.6070  0.6825 -0.41674  0.02975 -0.047738 -0.15923
## pH        0.4286 -0.6517  0.66384  0.07599 -0.265374  0.05070
## 
## 
## Site scores (weighted sums of species scores)
## 
##         PC1      PC2       PC3      PC4       PC5      PC6
## 18  0.05126  0.84085 -0.190142 -0.40431  0.101518 -1.01997
## 15  0.20452  0.37397  0.002978 -1.06030  1.175734 -0.92257
## 24 -1.53647 -1.25830 -0.012170 -0.86690 -1.854372  1.74375
## 27 -1.25643  0.51345  0.743928  0.63835  1.097881  0.15631
## 23 -0.71203  0.86247 -0.203389 -0.05971 -0.786196 -0.80080
## 19 -0.76179  0.73785 -0.761645 -0.31728 -1.303297 -0.62264
## 22 -0.30340  0.86304  0.188484  0.54758 -0.078064  0.24567
## 16  0.62797  0.76436 -0.124301 -0.26047  0.009044 -0.02807
## 28 -1.13923  0.44909  0.229104  1.61582  0.842871  0.60852
## 13 -0.62483 -0.85407 -1.259755  1.38195 -0.157593 -1.04794
## 14 -0.04206  0.60916 -0.719743 -0.73109 -0.230147  0.25505
## 20 -0.93747  0.20753 -0.689713  0.62198 -0.282188  0.33232
## 25 -0.34451  0.90128  0.710714  0.64127  1.214897  0.48173
## 7   1.50502 -0.22114 -0.495604  0.87068 -0.769266 -0.25733
## 5   1.40889  0.60035  0.911463  0.71133 -0.488111  2.45195
## 6   1.30776  0.03604 -0.243884 -0.87792  0.543259  0.46876
## 3   1.64967 -0.82755 -0.343406  1.40028 -0.546374 -0.19580
## 4  -0.26711 -1.65198 -1.744251 -0.60763  0.947492  0.46203
## 2   0.59653 -1.21610 -0.200030  0.72815  0.799208 -0.76013
## 9   0.03271 -0.69445 -0.350028 -1.35247  0.717140  0.82965
## 12  0.47944  0.26377  0.184677 -1.27744  0.573929  0.07253
## 10 -0.32993 -0.95483  1.670469 -0.51343  0.819138 -0.48682
## 11 -0.09921 -1.52318  2.493913 -0.09044 -1.092296 -1.10654
## 21  0.49069  1.17838  0.202333 -0.73801 -1.254205 -0.85966

Inertia: in vegan’s language, this is the general term for “variation” in the data

Constrained and unconstrained: In PCA, the analysis is unconstrained, i.e. not constrained by a set of explanatory variables, and so are the results

Eigenvalues, symbolized λj: these are measures of the importance (variance) of the PCA axes. They can be expressed as Proportions Explained, or proportions of variation accounted for by the axes, by dividing each eigenvalue by the “total inertia”.

Scaling: not to be confused with the argument “scale” calling for standardization of variables. “Scaling” refers to the way ordination results are projected in the reduced space for graphical display. There is no single way of optimally displaying objects and variables together in a PCA biplot, i.e., a plot showing two types of results, here the sites and the variables. Two main types of scaling are generally used. Each of them has properties that must be kept in mind for proper interpretation of the biplots. Here we give the essential features of each scaling. Please refer to Legendre and Legendre (2012) for a complete desciption.

  • Scaling 1 = distance biplot: the eigenvectors are scaled to unit length. (1) Distances among objects in the biplot are approximations of their Euclidean distances in multidimensional space. (2) The angles among descriptor vectors do not reflect their correlations.

  • Scaling 2 = correlation biplot: each eigenvector is scaled to the square root of its eigenvalue. (1) Distances among objects in the biplot are not approximations of their Euclidean distances in multidimensional space. (2) The angles between descriptors in the biplot reflect their correlations.

  • The compromise Scaling 3 has no clear interpretation rules, and therefore we will not discuss it further in this book.

Bottom line: if the main interest of the analysis is to interpret the relationships among objects, choose scaling 1. If the main interest focuses on the relationships among descriptors, choose scaling 2.

# Plots using biplot
# To help memorize the meaning of the scalings, vegan now accepts argument scaling = "sites" for scaling 1 and scaling="species" for scaling 2. This is true for all vegan functions involving scalings
par(mfrow = c(1, 2))
biplot(env.pca, scaling = 1, main = "PCA - scaling 1")
biplot(env.pca, main = "PCA - scaling 2") # Default scaling 2

# Plots using cleanplot.pca
cleanplot.pca(env.pca)

dev.off()
## null device 
##           1

Species scores: coordinates of the arrowheads of the variables. For historical reasons, response variables are always called “species” in vegan, no matter what they represent, because vegan is built for vegetation analysis.

summary(env.pca)$species
##                 PC1        PC2         PC3         PC4          PC5         PC6
## N         0.2440979  0.2515733 -0.23818234  0.92845641 -0.359553527  0.30318540
## P        -0.9180524 -0.4193826  0.13356210  0.09648767  0.224909353  0.07938380
## K        -0.9368729 -0.3271677 -0.06260980  0.25051247  0.180899487 -0.28683242
## Ca       -0.9280154 -0.1710871  0.47981561 -0.02137797 -0.241479006 -0.09536341
## Mg       -0.9087391 -0.1977676  0.12445939 -0.04429598 -0.497428380 -0.10781462
## S        -0.8576285 -0.6001923 -0.31619893 -0.01651531  0.071628806 -0.08778905
## Al        0.2979981 -0.9719998 -0.39009263  0.09092398  0.005385911 -0.19467254
## Fe        0.5236461 -0.7748245 -0.23063499  0.37187022 -0.023377408 -0.32291739
## Mn       -0.7417951  0.3908124  0.13113707  0.44346463  0.501064410  0.06775658
## Zn       -0.8926155 -0.3276820  0.06141777 -0.06121484 -0.153676876  0.51653200
## Mo       -0.1071693 -0.4625158 -0.85719968 -0.27258226 -0.030062540  0.39721211
## Baresoil -0.4217537  0.6386581 -0.40088872 -0.07388829 -0.416609754 -0.31059907
## Humdepth -0.6070224  0.6824708 -0.41673807  0.02975285 -0.047737672 -0.15923057
## pH        0.4285669 -0.6517311  0.66384419  0.07599465 -0.265373616  0.05070435

Site scores: coordinates of the sites in the ordination diagram. Objects are always called “Sites” in vegan output files.

summary(env.pca)$site
##            PC1         PC2          PC3         PC4          PC5         PC6
## 18  0.05125774  0.84085141 -0.190142229 -0.40430510  0.101517981 -1.01996800
## 15  0.20451664  0.37396845  0.002978186 -1.06030207  1.175734141 -0.92256629
## 24 -1.53646532 -1.25829807 -0.012169567 -0.86689800 -1.854371921  1.74375441
## 27 -1.25642818  0.51344912  0.743928187  0.63834616  1.097880596  0.15630775
## 23 -0.71203013  0.86246533 -0.203389168 -0.05971047 -0.786196338 -0.80080062
## 19 -0.76179163  0.73785383 -0.761645009 -0.31728175 -1.303296913 -0.62264463
## 22 -0.30339881  0.86303581  0.188484225  0.54758062 -0.078063726  0.24567029
## 16  0.62797317  0.76435889 -0.124301252 -0.26047338  0.009043821 -0.02806617
## 28 -1.13922500  0.44909361  0.229104347  1.61582083  0.842870639  0.60852025
## 13 -0.62483096 -0.85407334 -1.259754975  1.38195362 -0.157592594 -1.04794437
## 14 -0.04205762  0.60916369 -0.719742729 -0.73108738 -0.230147385  0.25504547
## 20 -0.93747058  0.20753216 -0.689713110  0.62197912 -0.282187780  0.33232127
## 25 -0.34450969  0.90128283  0.710713830  0.64127030  1.214896913  0.48172601
## 7   1.50502072 -0.22114340 -0.495603913  0.87068106 -0.769266148 -0.25732850
## 5   1.40888836  0.60035498  0.911462530  0.71133424 -0.488110924  2.45194953
## 6   1.30776410  0.03603859 -0.243884429 -0.87792010  0.543258504  0.46875839
## 3   1.64966903 -0.82755080 -0.343406202  1.40028095 -0.546373879 -0.19580193
## 4  -0.26710838 -1.65197820 -1.744251364 -0.60762599  0.947491698  0.46203382
## 2   0.59652585 -1.21609676 -0.200030393  0.72814752  0.799207666 -0.76013198
## 9   0.03271338 -0.69445160 -0.350028314 -1.35246903  0.717139959  0.82965192
## 12  0.47943679  0.26377399  0.184676644 -1.27743600  0.573929142  0.07253206
## 10 -0.32992544 -0.95482574  1.670469377 -0.51343047  0.819137833 -0.48681948
## 11 -0.09921276 -1.52318433  2.493912634 -0.09044412 -1.092296157 -1.10653970
## 21  0.49068874  1.17837957  0.202332694 -0.73801055 -1.254205128 -0.85965950

Eigenvalues

Examine eigenvalues is important to decide how many axes to retain. Usually, the user examines the eigenvalues, and decides how many axes are worth representing and displaying on the basis of the amount of variance explained. The decision can be completely arbitrary (for instance, interpret the number of axes necessary to represent 75% of the variance of the data), or assisted by one of several procedures proposed to set a limit between the axes that represent interesting variation of the data and axes that merely display the remaining, essentially random variance.

One of these procedures consists in computing a broken stick model, which randomly divides a stick of unit length into One of these procedures consists in computing a broken stick model, which randomly divides a stick of unit length into the same number of pieces as there are PCA eigenvalues. One interprets only the axes whose eigenvalues are larger than the length of the corresponding piece of the stick, or, alternately, one may compare the sum of eigenvalues, from 1 to k, to the sum of the values from 1 to k predicted by the broken stick model.

screeplot(env.pca, bstick = TRUE, npcs = length(env.pca$CA$eig))

Intepretation

The proportion of variance accounted for by the two axes is about 60 %: relative high value makes us confident that our interpretation of the first pair of axes extracts most relevant information from the data

par(mfrow = c(1, 2))
cleanplot.pca(env.pca)

dev.off()
## null device 
##           1

Scaling 1: Gradient from left to right with a group of sites displaying higher values of Ca, Mg, K, Zn, S. Gradient from top to down with humpdepth and baresoil.

Scaling 2: Humdepth, Baresoil are very highly positively correlated. Those two variables are very highly negatively correlated with pH and Fe.

Adding cluster

# combining clustering and ordination results
biplot(env.pca, main='PCA - scaling 1',scaling=1) 
ordicluster(env.pca, 
            hclust(dist(scale(env)), 'ward.D'), 
            prune=3, col = "blue", scaling=1)

PCA on transformed species data (tb-PCA)

PCA being a linear method preserving the Euclidean distance among sites, it is not naturally adapted to the analysis of species abundance data. Hellinger pre-transformation of the data can solve this problem.

# Hellinger pre-transformation of the species data
data(varespec)
spe<-varespec
spe.h<-decostand(spe,'hellinger')

# DCA + RDA
decorana (spe.h)
## 
## Call:
## decorana(veg = spe.h) 
## 
## Detrended correspondence analysis with 26 segments.
## Rescaling of axes with 4 iterations.
## 
##                   DCA1   DCA2    DCA3    DCA4
## Eigenvalues     0.2460 0.1503 0.09661 0.06645
## Decorana values 0.2538 0.1232 0.06096 0.02942
## Axis lengths    2.0513 1.6141 1.28896 0.98323
spe.h.pca<-rda(spe.h)
screeplot(spe.h.pca,bstick = TRUE, npcs = length(spe.h.pca$CA$eig))

# plot PCA
cleanplot.pca (spe.h.pca)

Note: It is possible to get a passive (post hoc) explanation of axes using environmental variables. Although there are means of incorporating explanatory variables directly in the ordination process (canonical ordination), one may be interested in interpreting a simple ordination by means of external variables.

#A posteriori projection of environmental variables in a PCA
# A PCA scaling 2 plot is produced in a new graphic window.
biplot(spe.h.pca, main = "PCA fish abundances-scaling 2")
# Scaling 2 is default
(spe.h.pca.env <- envfit(spe.h.pca, env, scaling = 2))
## 
## ***VECTORS
## 
##               PC1      PC2     r2 Pr(>r)    
## N         0.42906  0.90328 0.1921  0.129    
## P         0.41543 -0.90962 0.2817  0.031 *  
## K         0.60351 -0.79736 0.2128  0.081 .  
## Ca        0.59845 -0.80116 0.3463  0.017 *  
## Mg        0.62282 -0.78237 0.2600  0.051 .  
## S         0.09264 -0.99570 0.1182  0.250    
## Al       -0.96386  0.26642 0.4964  0.001 ***
## Fe       -0.94350  0.33136 0.4364  0.004 ** 
## Mn        0.87295 -0.48782 0.4638  0.001 ***
## Zn        0.65563 -0.75508 0.1872  0.119    
## Mo       -0.64250  0.76629 0.0498  0.569    
## Baresoil  0.99472  0.10260 0.2582  0.047 *  
## Humdepth  0.83492 -0.55037 0.4671  0.002 ** 
## pH       -0.99555  0.09422 0.2569  0.030 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
# Plot significant variables with a user -selected colour
plot(spe.h.pca.env, p.max = 0.05, col = 3)

# This has added the significant environmental variables to the
# last biplot drawn by R.
# BEWARE: envfit() must be given the same scaling as the plot to 
# which its result is added!

envfit proposes permutation tests to assess the significance of the R2 of each explanatory variable regressed on the two axes of the biplot. But this is not, by far, the best way to test the effect of explanatory variables on a table of response variables.

More on PCA

Conditions of applications:

  • PCA must be computed on dimensionally homogenous variables

  • data matrix not transposed since covariance or correlations among objects are meaningless

  • PCA can be applied to binary data

  • Species Presence-Absence data can be subject to a Hellinger or chord transformation prior PCA

  • Interpretation relationship variable based on angles

RP19: Using environmental dataset carp.chemistry from Carpathian wetland (mainly chemistry)

chem<-read.table ('https://www.dipintothereef.com/uploads/3/7/3/5/37359245/carp.chemistry.txt',header=T, sep=",",row.names=1)
  • remove the variable slope, which is not a chemical variable

  • perform a PCA, and evaluate importance of ordination axes using the broken stick criterion

  • make a biplots using scaling=1 and scaling 2, and interpret results?

  • Repeat the same for the species data associates

chem<-read.table ('https://www.dipintothereef.com/uploads/3/7/3/5/37359245/carp.chemistry.txt',header=T, sep=",",row.names=1) 

You can play a bit by adding cluster, post-hoc interpretation using environmental variables, etc.

#environmental data and RDA
chem$slope<-NULL
stand.chem <- scale (chem)
PCA1 <- rda (stand.chem) 

# broken stick
screeplot(PCA1,bstick = TRUE, npcs = length(PCA1$CA$eig))

# Keiser-Guttman Criteron (above average eigen value)
ev<-PCA1$CA$eig
barplot(ev, main='Eigenvalues',col='bisque',las=2)
abline(h=mean(ev),col='red') # average eigenvalue
legend('topright','Average eigenvalue',lwd=1,col=2,bty='n')

#biplot
clean.plot (PCA1)

Correspondance Analysis (CA)

  • Data submitted to CA must be frequencies or frequency-like, dimensionally homogeneous and non-negative; that is the case of species counts or presence-absence data.

  • Accodingly, for a long time, CA has been one of the favorite tools for the analysis of species presence-absence or abundance data

  • The raw data are first transformed into a matrix Q of cell-by-cell contributions to the Pearson Chi-square statistic, and the resulting table is submitted to a singular value decomposition to compute its eigenvalues and eigenvectors

  • The result is an ordination, where it is the Chi-square distance (D16) that is preserved among sites instead of the Euclidean distance D1.

  • Chi-square distance is not influenced by the double zeros. Therefore, CA is a method adapted to the analysis of species abundance data without pre-transformation.

Graphical representation

Accordingly, in a CA, both objects and species are represented by points in the ordination diagram (compared to PCA where species/descriptors are vectors and sites are points).

Note 1: Chi-square gives high weight to rare species, so usually considered as one the least suitable distance measures for ecological data.

Note 2: Suffers from an artefact called arch effect, which is caused by non-linear correlation between the first and higher axes. Popular, even though clumsy way how to remove this artefact is to use Detrending Correspondance Analysis (DCA)

Similarly to PCA, two type of scaling are possible:

  • scaling 1: the distances among objects (sites) in the reduced ordination space approximate their chi-square distance; any object found near the point representing a species is likely to contain a high contribution of that species.

  • scaling 2: the distances among descriptors (species) in the reduced space approximate their chi-square distances; any species that lies close to the point representing an object is more likely to be found in that object or to have higher frequency there.

Both The broken stick model and Kaiser-Guttman criterion applied for guidance on CA axes to retain.

Computation

Correspondance analysis can be computed using the function cca (library vegan). If the environmental matrix is not specified, cca calculates an unconstrained correspondence analysis.

You can apply the functions screeplot or evplot (Borcard et al. 2011), in order to select important ordination axes based on Kaiser-Guttman or broken stick model.

spe.ca<-cca(spe) #default summary for "scaling = 2" 
screeplot(spe.ca, bstick = TRUE, npcs = length(spe.ca$CA$eig))

ev2<-spe.ca$CA$eig # extract eigen value
evplot(ev2)

It is time to draw the CA biplots of this analysis. Let us compare the two scalings.

# CA biplots
par(mfrow=c(1,2))
# Scaling 1: sites are centroids of species
plot(spe.ca,scaling=1,main='CA - biplot scaling 1')
# Scaling 2: species are centroids of species
plot(spe.ca,main='CA - biplot scaling 2')

Another option using function ordplot

# CA biplots
par(mfrow=c(1,2))
# Scaling 1: sites are centroids of species
ordiplot(spe.ca,scaling=1,main='CA - biplot scaling 1')
# Scaling 2: species are centroids of species
ordiplot(spe.ca,main='CA - biplot scaling 2')

dev.off()
## null device 
##           1

Passive (Post Hoc) explanation of axes using environmental parameters

Here again you can use envfit from the vegan package: finds vectors or factors average of environmental variables. The projections of points onto vectors have maximum correlation with corresponding environmental variables, and the factors show the averages of factor levels.

# A posteriori projection of environmental variables in a CA
# The last plot produced (CA scaling 2) must be active
plot(spe.ca,main='CA- biplot scaling 2')
spe.ca.env <-envfit(spe.ca,env)
plot(spe.ca.env)

# It added the environment variables to the last biplot drawn
plot(spe.ca, main = "CA - scaling 2",
sub = "Fitted curves: humdepth (red), Baresoil (green)")
spe.ca.env <- envfit(spe.ca ~ Humdepth + Baresoil, env)
plot(spe.ca.env) # Two arrows
ordisurf(spe.ca, env$Humdepth, add = TRUE)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## 
## Estimated degrees of freedom:
## 1.77  total = 2.77 
## 
## REML score: 21.13899
ordisurf(spe.ca, env$Baresoil, add = TRUE, col = "green")

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## 
## Estimated degrees of freedom:
## 5.42  total = 6.42 
## 
## REML score: 93.07306

Arch effect and Detrended Correspondance Analysis (DCA)

Long environmental gradients often support a succession of species. Since the species that are controlled by environmental factors tend to have unimodal distribution, a long gradient may encompass sites that, at both ends of the gradient, have no species in common: their distance reaches a maximum value (or their similarity is 0). But if we look at either side of the succession, contiguous sites continue to grow more different from each other. Therefore, instead of linear trend on PCA, the gradient is represented on the pair of CA axes as an arch.

Detrending is the process of removing the arch effect: DCA does it by dividing the first axis into segments (or polynomial relationship), and then by centering the second axis on zero. Watch here

Detrended Correspondance Analysis (DCA) is often criticized and not recommended. Howver, DCA is still one of the most widely used unconstrained ordination methods among vegetation ecologist (zoologist are biased toward nMDS).

doubs.dca<-decorana(doubspec)
plot(doubs.dca)

RP20: Using Ellenberg’s Danube meadow dataset (data mveg, package dave):

  • Compare the results of CA and DCA

  • Try (more challenging) to combine the results of both CA and DCA in the same ordination plot. Results should look similar to this:

(1) You may need functions cca, decorana, ordiplot, scores and text.

(2) First calculate both CA and DCA on Danube data and draw CA ordination scatterplot (to draw only sites, in ordiplot use argument display = 'sites').

(3) To add sites from DCA, you need to shift their scores along the second (vertical) axis, otherwise they will be clustered together with CA samples. Add constant (e.g. 2) to the scores along the second axis. To extract scores, use the function scores on object DCA with argument display = 'sites' ).

(4) To add sites from DCA into CA ordination plot, use low-level graphical function text on matrix of scores from DCA, with corrected second axis.

(5) To avoid overlap of labels in text function, employ also the argument labels, which should contain values from rownames of DCA scores.

Principal Coordinate Analysis (PCoA)

Also called Multi-Dimensional Scaling (MDS)

  • PCA and CA both impose the preservation of a distance among objects: the Euclidean distance and the chi-squared distance, respectively.

  • If you want to ordinate objects on the basis of another distance measure, then PCoA is the method of choice

  • PCoA provides a Euclidean representation of a set of objects whose relationships are measured by any similarity or distance measured chosen by the user.PCoA should be reserved to situations where no Euclidean measure is preserved.

  • Like PCA and CA, PCoA produces a set of orthogonal axes whose importance is measured by eigenvalues. Since it is based on an association matrix , it can directly represent the relationships either among objects (Q mode matrix) or variables (R mode matrix).

  • If non-Euclidean association coefficient, PCoA may produce several negative eigenvalues in addition to the positive one: can be remediate by adding a constant (Lingoes or Caillez correction)

  • Can also project variables (e.g. species) on a PCoA a posteriori

Note: Computing Euclidean distance among sites and running a PCoA yields the exact same results as running a PCA of the same data and looking at the scaling 1 ordination results.

Computation

  • cmdscale (library vegan) – calculates PCoA of distance among samples (this could be calculated e.g. by function vegdist). Use function ordiplot to project ordination.

  • pcoa (library ape) – another way to achieve PCoA analysis. Use biplot.pcoa function to project ordination

The ordination axes of a PCoA can be interpreted like those of a PCA or CA: proximity of objects represents similarity in the sense of the association measured used

# PCoA on a Bray-Curtis dissimilarity matrix of fish species
spe.bray<-vegdist(spe)
spe.b.pcoa<-cmdscale(spe.bray, eig=TRUE, add=T)
# Plot of the sites and weighted average projection of species
ordiplot(spe.b.pcoa, type='t', main='PCoAwith species')
abline(h=0,lty=3)
abline(v=0, lty=3)
# add species (weighted average species abundance)
spe.wa<-wascores(spe.b.pcoa$points[,1:2],spe)
text (spe.wa,rownames(spe.wa),cex=0.7,col='red')

doubs data with pcoa

doubspec.bray<-vegdist(doubspec)
doubspec.bray.pcoa<-pcoa(doubspec.bray)
biplot.pcoa(doubspec.bray.pcoa,doubspec)
abline(h=0,lty=3)
abline(v=0,lty=3)

non-metric Multidimensional Scaling (nMDS)

  • If priority is not to preserve the exact distances among objects in an ordination diagram, but rather to represent as well as possible the ordering relationship among objects in a small and specified number of axes

  • Like PCoA, nMDS can produce ordinations of objects from any distance matrix

  • Method can cope with missing distances, as long as there are enough measures left to position each object with respect to each other.

This is NOT a eigenvalue based ordination method, and does not maximize the variability associated with individual axes of the ordination.

Procedure

Very simple:

  • specify the number m of axes (dimensions) – usually 2 !

  • construct an initial configuration of the objects in the m dimensions, to be used as a starting point of an iterative adjustment process. This is a tricky step, since the end-results may depend on the starting configuration.

  • an iterative procedure tries to position the objects in the requested number of dimensions in such was as to minimize a stress function (scaled from 0 to 1), which measures how far the distances in the reduced-space configuration are from being monotonic to the original distance in the association matrix

  • the adjustment goes on until the stress value can no more be lowered, or predetermined value (sometimes never reached)

Computation

Functions:

  • metaMDS (library vegan) – advanced function, composed of many sub-routine steps. Species points are added to the ordination plot using wascores

  • isoMDS (library MASS) if missing values in the distance matrix and bestnmds package labdsv

spe.nmds<-metaMDS(spe,distance='bray',trymax=999)
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1843196 
## Run 1 stress 0.20883 
## Run 2 stress 0.1852397 
## Run 3 stress 0.195049 
## Run 4 stress 0.2109011 
## Run 5 stress 0.2021169 
## Run 6 stress 0.18458 
## ... Procrustes: rmse 0.04939534  max resid 0.1577226 
## Run 7 stress 0.1955842 
## Run 8 stress 0.196246 
## Run 9 stress 0.2061123 
## Run 10 stress 0.1869637 
## Run 11 stress 0.1869637 
## Run 12 stress 0.195049 
## Run 13 stress 0.1825658 
## ... New best solution
## ... Procrustes: rmse 0.04167693  max resid 0.1520288 
## Run 14 stress 0.1955859 
## Run 15 stress 0.2137686 
## Run 16 stress 0.2263604 
## Run 17 stress 0.2178675 
## Run 18 stress 0.18458 
## Run 19 stress 0.208595 
## Run 20 stress 0.1948414 
## Run 21 stress 0.2085516 
## Run 22 stress 0.217136 
## Run 23 stress 0.1843196 
## Run 24 stress 0.1843196 
## Run 25 stress 0.2082887 
## Run 26 stress 0.2093113 
## Run 27 stress 0.2382124 
## Run 28 stress 0.2325124 
## Run 29 stress 0.2538721 
## Run 30 stress 0.195049 
## Run 31 stress 0.264557 
## Run 32 stress 0.2048307 
## Run 33 stress 0.2300995 
## Run 34 stress 0.1852397 
## Run 35 stress 0.2315384 
## Run 36 stress 0.220784 
## Run 37 stress 0.1950491 
## Run 38 stress 0.2028828 
## Run 39 stress 0.2085949 
## Run 40 stress 0.1969805 
## Run 41 stress 0.1948413 
## Run 42 stress 0.2695291 
## Run 43 stress 0.2095882 
## Run 44 stress 0.195049 
## Run 45 stress 0.2360886 
## Run 46 stress 0.1825658 
## ... New best solution
## ... Procrustes: rmse 2.889603e-05  max resid 8.937788e-05 
## ... Similar to previous best
## *** Solution reached
spe.nmds
## 
## Call:
## metaMDS(comm = spe, distance = "bray", trymax = 999) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     wisconsin(sqrt(spe)) 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.1825658 
## Stress type 1, weak ties
## Two convergent solutions found after 46 tries
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'wisconsin(sqrt(spe))'
spe.nmds$stress
## [1] 0.1825658
plot(spe.nmds,type='t',main=paste('NMDS/Bray–Stress =',round(spe.nmds$stress,3)))

Quality and stress

  • A useful way to assess the appropriateness of an nMDS is to compare, in a Shepard diagram, the distance among objects in the ordination with the original distances.

  • stressplot (library vegan) – draws Shepard stress plot, which is the relationship between real distances between samples in resulting m dimensional ordination solution, and their particular dissimilarities.

stressplot(spe.nmds, main='Shepard plot')

  • In addition, the goodness-of-fit of the ordination is measured as the R2 of either a linear or a non-linear regression of the NMDS distances on the original ones
# goodness of fit
gof<-goodness(spe.nmds)
plot(spe.nmds,type='t',main='Goodness of fit')
points(spe.nmds, display='sites', cex=gof*90)

Adding cluster

# Add colours from a clustering results to an NMDS plot
# Ward clustering of Bray-Curtis dissimilarity matrix
spe.bray.ward <- hclust(spe.bray,'ward.D')
spe.bw.groups <- cutree(spe.bray.ward,k=4)
grp.lev <- levels(factor(spe.bw.groups))

# combination with NMDS result 
sit.sc <- scores(spe.nmds)
p <- ordiplot (sit.sc, type='n', main='NMDS/BRAY – clusters Ward/Bray')
for (i in 1:length(grp.lev)) {
    points(sit.sc[spe.bw.groups==i,],pch=(14+   i),cex=2, col=i+1)
    }
text(sit.sc,row.names(spe),pos=4,cex=0.7)
#add the dendrogram
ordicluster(p,spe.bray.ward,col='dark grey')

#
# legend(locator(1), paste('Group',c(1:length(grp.lev))),pch=14+c(1:length(grp.lev)), col=1+c(1:length(grp.lev)),pt.cex=2)
#
# using locator you need to point out where you want to put the legend

ordihull& ordispider

data(dune)
data(dune.env)
attach(dune.env)
NMDS.dune<-metaMDS(dune,distance='bray')
## Run 0 stress 0.1192678 
## Run 1 stress 0.1808913 
## Run 2 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 0.02026817  max resid 0.06494689 
## Run 3 stress 0.1812938 
## Run 4 stress 0.1183186 
## ... Procrustes: rmse 9.834241e-06  max resid 3.48835e-05 
## ... Similar to previous best
## Run 5 stress 0.1183186 
## ... Procrustes: rmse 2.995406e-05  max resid 9.342105e-05 
## ... Similar to previous best
## Run 6 stress 0.1192678 
## Run 7 stress 0.2915904 
## Run 8 stress 0.1192678 
## Run 9 stress 0.119269 
## Run 10 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 1.164817e-05  max resid 2.959214e-05 
## ... Similar to previous best
## Run 11 stress 0.1992853 
## Run 12 stress 0.1183186 
## ... Procrustes: rmse 6.64598e-05  max resid 0.0001912364 
## ... Similar to previous best
## Run 13 stress 0.1192678 
## Run 14 stress 0.2377443 
## Run 15 stress 0.1192683 
## Run 16 stress 0.1192678 
## Run 17 stress 0.1939205 
## Run 18 stress 0.1809581 
## Run 19 stress 0.1183186 
## ... Procrustes: rmse 8.43254e-05  max resid 0.0002770348 
## ... Similar to previous best
## Run 20 stress 0.1183186 
## ... Procrustes: rmse 4.895635e-05  max resid 0.0001382342 
## ... Similar to previous best
## *** Solution reached
plot(NMDS.dune,type='t',main=paste('NMDS/Bray – Stress =',round(NMDS.dune$stress,3)))
pl<-ordihull(NMDS.dune, Management, scaling = 3, draw='polygon',col='grey')
ordispider(pl, col="red", lty=3, label = TRUE)

# ?anosim
# ?adonis

ordisurf

Fits a smooth surface for given variable and plots the result on ordination diagram (Generalized Additive Model)

data(varespec)
data(varechem)
vare.dist <- vegdist(varespec)
vare.mds <- metaMDS(vare.dist)
## Run 0 stress 0.1000211 
## Run 1 stress 0.1000212 
## ... Procrustes: rmse 0.0001275038  max resid 0.0005458601 
## ... Similar to previous best
## Run 2 stress 0.1000211 
## ... New best solution
## ... Procrustes: rmse 3.189608e-05  max resid 0.0001383784 
## ... Similar to previous best
## Run 3 stress 0.1607502 
## Run 4 stress 0.1000211 
## ... Procrustes: rmse 3.220108e-05  max resid 0.0001401035 
## ... Similar to previous best
## Run 5 stress 0.1000211 
## ... Procrustes: rmse 3.21468e-05  max resid 0.0001395246 
## ... Similar to previous best
## Run 6 stress 0.1000211 
## ... New best solution
## ... Procrustes: rmse 2.761299e-05  max resid 0.0001193893 
## ... Similar to previous best
## Run 7 stress 0.1000211 
## ... Procrustes: rmse 9.415901e-06  max resid 3.069446e-05 
## ... Similar to previous best
## Run 8 stress 0.1000211 
## ... Procrustes: rmse 3.012027e-05  max resid 0.0001052599 
## ... Similar to previous best
## Run 9 stress 0.1000212 
## ... Procrustes: rmse 9.067517e-05  max resid 0.0003718376 
## ... Similar to previous best
## Run 10 stress 0.1000211 
## ... Procrustes: rmse 1.981557e-05  max resid 8.566501e-05 
## ... Similar to previous best
## Run 11 stress 0.2212131 
## Run 12 stress 0.2589642 
## Run 13 stress 0.1926906 
## Run 14 stress 0.1000211 
## ... Procrustes: rmse 1.687417e-05  max resid 7.141313e-05 
## ... Similar to previous best
## Run 15 stress 0.1000211 
## ... Procrustes: rmse 2.037522e-06  max resid 8.599213e-06 
## ... Similar to previous best
## Run 16 stress 0.1000211 
## ... Procrustes: rmse 6.563137e-05  max resid 0.0002844984 
## ... Similar to previous best
## Run 17 stress 0.1616364 
## Run 18 stress 0.1000211 
## ... New best solution
## ... Procrustes: rmse 3.857991e-06  max resid 1.49355e-05 
## ... Similar to previous best
## Run 19 stress 0.1000211 
## ... Procrustes: rmse 1.623038e-05  max resid 7.009533e-05 
## ... Similar to previous best
## Run 20 stress 0.1616352 
## *** Solution reached
ordisurf(vare.mds ~ Baresoil, varechem, bubble = 5)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## 
## Estimated degrees of freedom:
## 5.17  total = 6.17 
## 
## REML score: 93.32737

RP21: Use Tikus dataset (library mvabund) and load data from coral reefs - selected samples from year 1981, 1983 and 1985 (you need both species data in tikus.spe and environmental data in tikus.env, which contain the year of sampling)

  • From species data, calculate two distance matrices: one using Bray-Curtis index, and the second using Euclidean distances (data should be log transformed)

  • Use each of the matrices to calculate two nMDS. Project each nMDS into 2D ordination diagrams (you will have two ordination diagrams, one for NMDS on Bray-Curtis dissimilarities, one for NMDS on Euclidean distances). In these diagrams use different symbols for samples from different years. Draw also the legend to the upper right corner of the diagram. Compare species compossition among years using adonis (?adonis).

Constrained ordinations

Also called direct Gradient Analysis or constrained Analysis

  • Explanatory analyses may reveal the existence of clusters or groups of objects in a data set. When a supplementary table or matrix of env. variables is available for those objects , it is then possible to examine whether the observed pattern are related to environmental gradients

  • Objectives may be to reveal the link between community structure and habitat heterogeneity, between community structure and spatial distance, or to identify the main variables affecting communities when a large set of env. variables has been collected

Do bird communities related to environmental variables?

At the opposite of unconstrained ordination, where effect of env. variables can be interpret a posteriori, canonical ordination associates two (or more) data sets in the ordination process itself. It formally tests statistical hypotheses about the significance of these relationships

Indirect vs direct interpretation

  • In constrained (canonical) ordination analyses, only the variation in the species table that can be explained by the environmental variables is displayed and analyzed, and not all the variation in the species table

  • Gradients are supposed to be known and represented by the measured variable or the combinations, while species abundance or occurrence is considered to be a response to those gradients

  • The aim is to find the best mathematical relationship between species composition and the measure environmental gradients

  • Constrained ordinations are mostly based on multivariate linear models relating principal axis to the observed env. variables, and the different techniques depend on data types (matrix or table), and on the hypothesis underlying species distribution in the gradients (i.e. linear or unimodal)

inear vs. unimodal

RDA has a wide application

Redundancy Analysis (RDA)

  • RDA is a method combining regression and principal component analysis (direct extension of regression analysis to model multivariate response data)

  • RDA is an extremely powerful tool in the hands of ecologists, especially since the introduction of the Legendre and Gallagher (2001) transformations that open RDA to the analysis of community composition data tables (transformation-based RDA tb-RDA)

  • RDA = multivariate (meaning multiresponse) multiple linear regression followed by a PCA of the table of fitted values. Method seeks, in successive order, a series of linear combinations of the combinations of the explanatory variables that best explain the variation of the response matrix.

In RDA, the number of canonical axes corresponds to the number of explanatory variables, or more precisely to the number of degrees of freedom. Each canonical axes corresponds to a linear combination (i.e. multiple regression model) of all explanatory variables

Computation

In R, the vegan functions:

  • rda perfoms a rda if a matrix of environmental variable is supplied (if not it calculates a PCA)

  • RsquareAdj extracts the value of R2 and adjusted-R2 from results of ordination (and also regression)

vegan allows the computation of an RDA in two different ways.

  • Matrix syntax

The simplest syntax is to list the names of the objects (matrices) involved separated by commas such as:

RDA <- rda (Y, X, W) 

where Y is the response matrix (species composition), X is the explanatory matrix (environmental factors) and W is the matrix of covariables.

This is the simplest way, but it does not allow factors (qualitative variables) to be included in the explanatory and covariable matrices.

  • Formula syntax
RDA <- rda (Y ~ var1 + factorA + var2*var3 + Condition (var4), data = XW)

In this example, the explanatory variables used are a quantitative variable var1, categorical variable factorA,and interaction term between var2 and var3, whereas var4 is used as covariable and idnetified to partially removed its effects.

Examples (doubs data)

Data preparation
# import the data
data (doubs)
spe <- doubs$fish
env <- doubs$env
spa <- doubs$xy 
# remove empty site 8
spe<-spe[-8,]
env<-env[-8,]
# set aside the variable 'dfs' (distance from the source) for later use
dfs<-env[,1]
#remove the 'dfs' variable from the env dataset
env<-env[,-1]
#recode the slope variable (slo) into a factor (qualitative) variable (to show how these are handled in the ordinations)
slo2<-rep('very_steep',nrow(env))
slo2[env$slo<=quantile(env$slo)[4]] = 'steep'
slo2[env$slo<=quantile(env$slo)[3]] = 'moderate'
slo2[env$slo<=quantile(env$slo)[2]] = 'low'
slo2 <- factor(slo2,levels=c('low','moderate','steep','very_steep'))
# create an env2 data frame with slope as a qualitative variable
env2<-env
env2$slo<-slo2
# create two subsets of explanatory variables
# Physiography (upstream-downstream gradient
envtopo<-env[,c(1:3)] # names(envtopo), covariate matrix isolated form env
# water quality 
envchem <- env[,c(4:10)] # names(envchem), env. matrix isolated form env
# Hellinger-transform the species dataset
spe.hel<-decostand(spe,'hellinger') # spe matrix
rda

Same kind of formula as used in lm and other R functions devoted to regression. Check ?rda for details.

spe.rda <- rda(spe.hel~.,env2) 
summary (spe.rda) # scaling 2 (default)

  • Partitioning of variance: variance of Y explained by the X variables (constrained proportion, 72.71% here), the unexplained variance of Y (unconstrained proportion, 27.29% here). This R2 is biased.

  • Eigenvalues and their contribution: Summarize the eigenvalues, the proportions explained and the cumulative proportion of each canonical axis (each canonical axis = each constraining variable, in this case, the environmental variables from env). Cumulative proportion give the biased R2.

  • Accumulated constrained eigenvalues: these are cumulative amounts of variance expressed as proportions of the total explained variance, as opposed to their contribution to the total variance described before

  • Species scores: coordinates of the tips of the vector representing the response variables in the bi- or tri- plots. Depending on the scaling chosen

  • Site scores: coordinates of the sites as expressed in the space of the response variables Y

  • Site constraints: coordinates of the sites in the space of the explanatory variable X. Fitted site scores

  • Biplot scores: coordinates of the tips of the vectors representing explanatory variables. Correlations based. For factors, representation of the centroids of the levels is preferable.

  • Centroids for factor constraints: coordinates of centroids of levels of factor variables, i.e. means of the scores of the sites possessing state ‘1’ for a given level.

In the rda output, an interesting information is missing: the canonical coefficients, i.e. the equivalent of regression coefficients for each explanatory variable on each canonical axis. They can be retrieved using coef:

coef(spe.rda)
##                        RDA1          RDA2          RDA3          RDA4
## alt            4.482548e-04  7.559499e-05  5.205825e-04  0.0003883845
## slomoderate   -1.239617e-02 -1.660194e-02  1.600691e-02 -0.0278562534
## slosteep       4.783902e-02  4.893302e-02  1.022578e-01  0.1347997771
## slovery_steep  1.800056e-02 -5.691933e-02  2.322638e-01  0.1002359565
## flo           -1.406377e-05  4.440084e-05  8.929849e-05  0.0001647159
## pH             1.882277e-03 -3.163592e-03 -4.820217e-03  0.0114164750
## har            2.558034e-03 -1.955496e-03 -6.590194e-03 -0.0093556696
## pho            1.031542e-03  4.583584e-04 -1.000153e-03 -0.0010502434
## nit           -1.238247e-04  1.041485e-03  6.253967e-04  0.0007742183
## amm           -1.084412e-03 -4.407786e-03  5.724774e-05  0.0005385427
## oxy            6.866921e-03  1.980446e-03 -8.941533e-03  0.0120088406
## bdo            1.083225e-03 -2.696114e-03 -2.532252e-03  0.0074517578
##                        RDA5          RDA6          RDA7          RDA8
## alt            0.0018572061 -6.313946e-05 -1.355362e-03  0.0011208492
## slomoderate    0.2761288455  1.310695e-01 -2.291843e-02  0.0188300635
## slosteep       0.3938129289 -1.795824e-01  4.631974e-02  0.1236428214
## slovery_steep  0.0368146352 -1.742060e-01  5.172993e-01  0.0675640140
## flo            0.0001331845  2.705757e-05 -2.419805e-05  0.0001039463
## pH             0.0412886847  1.091403e-02  1.398064e-02 -0.0436295510
## har            0.0052287075 -6.098098e-03  2.195518e-03  0.0105362477
## pho            0.0042299123 -3.694132e-03  3.587466e-04 -0.0070104314
## nit            0.0023440122 -3.541252e-04 -2.404285e-03  0.0012840316
## amm           -0.0181246888  4.798631e-05  2.819379e-03  0.0106801348
## oxy            0.0032052566  3.880445e-03 -5.802604e-03  0.0061374900
## bdo            0.0067082880  9.276548e-03 -1.971950e-03  0.0047865971
##                        RDA9         RDA10         RDA11         RDA12
## alt           -2.530083e-04  1.189659e-03  6.826822e-04  0.0009471677
## slomoderate   -3.113354e-01 -2.780063e-01  3.980809e-02 -0.2974896027
## slosteep       9.638200e-02 -4.471000e-01  2.445036e-01 -0.3475535793
## slovery_steep -2.262451e-01 -5.900229e-01  2.457104e-01 -0.1848717482
## flo           -6.430624e-06  4.427139e-05 -2.256503e-05  0.0000648586
## pH            -2.158410e-03 -9.040638e-02  6.960453e-03  0.0575630103
## har           -6.844877e-04  3.353110e-03  7.758175e-04  0.0062181193
## pho           -2.865315e-03  2.459177e-03 -1.554490e-05 -0.0063090083
## nit           -6.863650e-04  1.130900e-03  3.983544e-03  0.0009422466
## amm            3.084216e-03 -1.217501e-02 -1.596411e-02  0.0089790152
## oxy           -1.964441e-03  8.988104e-03  6.274157e-03  0.0025893743
## bdo            3.593651e-03  6.541604e-03  1.113928e-02  0.0040315843

In addition like the ordinary R2 from multiple regression the R2 is biased. It can be cure by adjusting the R2 using Ezekiel’s formula (Ezekiel 1930). An adjusted R2 near 0 indicates that X does not explain more of the variation of Y than random normal variables would do. It can be negative, indicating that the explanatory variables X do worse than random normal variable would. The R2 and adjusted-R2 can be computed using vegan’s function RsquareAdj

#Retrieval of the adjusted R2
# Unadjusted R2 retrieve from RDA results
R2<-RsquareAdj(spe.rda)$r.squared
# Adjusted R2 retrieve from RDA object
R2adj<-RsquareAdj(spe.rda)$adj.r.squared

Plots

We can call this a triplot since there are three different entities in the plot: sites, response variables and explanatory variables. To differentiate the latter two, it is good to draw arrowhead only on the vector of the quantitative explanatory variables, not on the response variable vectors

# triplot of the rda results
par(mfrow = c(2, 2))

# site scores are weighted by sum of species
# scaling 1: distance triplot
plot (spe.rda, scaling=1, main='Triplot RDA spe.hel ~ env – scaling 1 – wa scores')
spe.sc <- scores (spe.rda, choices=1:2, scaling=1, display='sp') 
arrows (0,0, spe.sc[,1],spe.sc[,2],length=0,lty=1,col='red') 

# scaling 2 (default)
plot (spe.rda,main='Triplot RDA spe.hel ~ env – scaling 2 – wa scores')
spe2.sc <- scores (spe.rda, choices=1:2, display='sp')
arrows (0,0, spe2.sc[,1],spe2.sc[,2],length=0,lty=1,col='red')

# site scores are linear combinations of the environmental variables
# scaling 1
plot (spe.rda,scaling=1,display=c('sp','lc','cn'),main='Triplot RDA spe.hel ~ env – scaling 1 – lc scores')
arrows (0,0, spe.sc[,1],spe.sc[,2],length=0,lty=1,col='red')
# scaling 2
plot (spe.rda,display=c('sp','lc','cn'),main='Triplot RDA spe.hel ~ env – scaling 2 – lc scores') # cn for centroids
arrows (0,0, spe2.sc[,1],spe2.sc[,2],length=0,lty=1,col='red')

Interpretation

  • Interpretation of the two scaling

    • For the species and the sites the interpretation of the two scalings in the same as in PCA

    • Presence of vectors and centroids of explanatory variables call for additional rules

  • Let interpret our two last triplots:

    • First two canonical axis alone explaining 58.3% of the total variance of the data, the first axis alone explaining 47.6% (unadjusted values)

    • Since R2-adj = 0.5320, the percentages of accumulated constrained eigenvalues show that the first axis alone explains 0.5224 x 0.643 = 0.336 or 33.6% variance, and the two first 0.5224 x 0.788 = 0.412 or 41.2%

    • Major trends have been modelled in this analysis ! In ecology, data are quite noisy, so you should never expect a very high value of R2 adj

Overall test

The overall test function is called anova.cca. This name is unfortunate, since it leads to confusion with the classical ANOVA test, which it is not !

# Global test of the RDA results
anova.cca(spe.rda,step=1000)
## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe.hel ~ alt + slo + flo + pH + har + pho + nit + amm + oxy + bdo, data = env2)
##          Df Variance      F Pr(>F)    
## Model    12  0.36537 3.5522  0.001 ***
## Residual 16  0.13714                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test of the axes can only be run with the formula synthax. How many canonical axes are significant?

# Test of all canonical axes
anova.cca(spe.rda,by='axis',step=1000)
## Permutation test for rda under reduced model
## Forward tests for axes
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe.hel ~ alt + slo + flo + pH + har + pho + nit + amm + oxy + bdo, data = env2)
##          Df Variance       F Pr(>F)    
## RDA1      1 0.228081 26.6098  0.001 ***
## RDA2      1 0.053696  6.2646  0.006 ** 
## RDA3      1 0.032124  3.7478  0.370    
## RDA4      1 0.023207  2.7075  0.771    
## RDA5      1 0.008707  1.0159  0.999    
## RDA6      1 0.007218  0.8421  1.000    
## RDA7      1 0.004862  0.5673  1.000    
## RDA8      1 0.002919  0.3406  1.000    
## RDA9      1 0.002141  0.2497  1.000    
## RDA10     1 0.001160  0.1353  1.000    
## RDA11     1 0.000913  0.1066  1.000    
## RDA12     1 0.000341  0.0397  1.000    
## Residual 16 0.137141                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Partial rda
  • Partial canonical ordination is the multivariate equivalent of partial linear regression: it is possible to run a RDA of a (transformed) plant species data matrix Y, explained by a matrix of climatic variables X, in the presence of soil covariables W. Display the patterns of the species data uniquely explained by a linear model of the climatic variables when the effect of the soil is held constant

  • Let’s play with the two objects containing subsets of environmental variables: physiographical variables (envtopo), i.e. altitude, slope (orginal one not qualitative), and flow / water chemistry (envchem) i.e, pH, hardness, phosphates, nitrates, ammonium, oxygen content as well as biological oxygen demand

Does water chemistry significantly explain the fish species pattern when the effect of the topographical gradient is held constant?

# Partial RDA: effect of water chemistry, holding physiography constant

# simple interface; X and W may be separate tables of quantitative variales

spechem.physio <- rda(spe.hel,envchem,envtopo)
spechem.physio
summary(spechem.physio)

# formula interface; X and W must be in the same data fram
class(env)
spechem.physio2<-rda(spe.hel~pH+har+pho+nit+amm+oxy+bdo+Condition(alt+slo+flo), data=env)
spechem.physio2
summary(spechem.physio2)

# the results of the two analyses are identical
# check in the output: four components in the partitioning variance instead of 3

# Test of the partial RDA (using the results with the formula interface to allow the tests of the axes to be run)

anova.cca(spechem.physio2,step=1000)
anova.cca(spechem.physio2,step=1000,by='axis')

# partial RDA triplots (with fitted site score)
par(mfrow=c(1,2))
#scaling 1
plot(spechem.physio,scaling=1,display=c('sp','lc','cn'), main='Triplot RDA spe.hel ~ chem | topo – scaling 1 – lc scores')
spe3.sc <- scores(spechem.physio,choices=1:2,scaling=1, display='sp')
arrows(0,0,spe3.sc[,1],spe3.sc[,2],length=0,lty=1,col='red')

#scaling 2
plot(spechem.physio,display=c('sp','lc','cn'), main='Triplot RDA spe.hel ~ chem | topo – scaling 2 – lc scores')
spe4.sc <- scores(spechem.physio,choices=1:2,display='sp')
arrows(0,0,spe4.sc[,1],spe4.sc[,2],length=0,lty=1,col='red')

The result of this partial analysis differs somewhat from the previous results but not fundamentally. har and nit are less important in explaining the fish community structure. This may be due to the fact that these two variables are well correlated with the position along the river.

Forward selection of explanatory variables
  • It happens we wish to reduce the number of explanatory variables. The reasons vary: search for parsimony, rich data set but poor at priori hypotheses, or a method producing a large set of explanatory variables which must be reduced afterwards.

  • In doubs data, there could be two reasons to reduce the number of explanatory variables: search for parsimony, and possible strong linear dependencies (correlations) among the explanatory variables in the RDA model

  • Linear dependencies can be explored by computing the variables’ variance inflation factor. VIFs above 20 indicates strong collinearity. Ideally, VIFs above 10 should be at least examined, and avoid if possible. VIFs can be computed in vegan after rda or cca using the function vif.cca

# Variance inflation factor (VIF) in two RDAs 
# First RDA: all environmental variables
vif.cca(spe.rda)
##           alt   slomoderate      slosteep slovery_steep           flo 
##     20.397021      2.085921      2.987679      4.594983      6.684187 
##            pH           har           pho           nit           amm 
##      1.363575      3.049937     30.614913     18.953092     40.114746 
##           oxy           bdo 
##      6.854703     17.980889
# Second RDA: subset of environmental variables
vif.cca (spechem.physio)
##       alt       slo       flo        pH       har       pho       nit       amm 
## 17.475490  4.221651  6.021996  1.471728  2.994588 25.379501 15.630086 30.353271 
##       oxy       bdo 
##  7.623932 18.098637
# A reduction is justified !!!

To select the significant explanatory variables, you can then perform a stepwise selection (forward or backward), using the ordistep or ordiR2step function (or using the forward.sel function of package packfor)

# Forward selection using ordistep() function. This function allows the use of factors. Options are also available for stepwise and backward selection of the explanatory variables
spe.rda.all<-rda(spe.hel~.,data=env)
step.forward1 <- ordistep(rda(spe.hel~1,data=env), scope=formula(spe.rda.all),direction='forward',pstep=1000)
step.forward2 <- ordiR2step(rda(spe.hel~1,data=env), scope=formula(spe.rda.all),direction='forward',pstep=1000)

These analyses show that the most parsimonious attitude would be to settle a model containing only three variables: altitude, oxygen, and biological oxygen demand

Parsimonious RDA
spe.rda.pars <- rda(spe.hel~alt+oxy+bdo, data=env)
spe.rda.pars
anova.cca(spe.rda.pars,step=1000)
anova.cca(spe.rda.pars,step=1000,by='axis')
vif.cca(spe.rda.pars)
R2a.pars <- RsquareAdj(spe.rda.pars)$adj.r.squared
# Triplots of the parsimonious RDA (with fitted site scores)
# scaling 1

plot(spe.rda.pars,scaling=1,display=c('sp','lc','cn'),main='Triplot RDA spe.hel ~ alt+oxy+bdo – scaling 1 – lc scores')
spe4.sc <- scores(spe.rda.pars, choices=1:2, scaling=1,display='sp')
arrows(0,0,spe4.sc[,1],spe4.sc[,2],length=0,lty=1, col='red')

# scaling 2

plot(spe.rda.pars, display=c('sp','lc','cn'),main='Triplot RDA spe.hel ~ alt+oxy+bdo – scaling 2 – lc scores')
spe5.sc <- scores(spe.rda.pars, choices=1:2,display='sp')
arrows(0,0,spe5.sc[,1],spe5.sc[,2],length=0,lty=1, col='red')

# since there is now a third significant canonical axis, you could compute other combinations: axes 1 and 3, axes 2 and 3
Variation partitioning
  • Variation partitioning is a type of analysis that combines RDA and partial RDA to divide the variation of a response variable among two, three or four explanatory data sets.

  • Variation partitioning are generally represented by Venn diagram in which the percentage of explained variance by each explanatory data set (or combination of data stets) is reported.

# Variation partitioning with two sets of explanatory variables
# and all explanatory variables
showvarparts(2)# explanation fraction labels for two explanatory matrices

spe.part.all <- varpart(spe.hel,envchem,envtopo)
spe.part.all
## 
## Partition of variance in RDA 
## 
## Call: varpart(Y = spe.hel, X = envchem, envtopo)
## 
## Explanatory tables:
## X1:  envchem
## X2:  envtopo 
## 
## No. of explanatory tables: 2 
## Total variation (SS): 14.07 
##             Variance: 0.50251 
## No. of observations: 29 
## 
## Partition table:
##                      Df R.squared Adj.R.squared Testable
## [a+b] = X1            7   0.60579       0.47439     TRUE
## [b+c] = X2            3   0.39888       0.32674     TRUE
## [a+b+c] = X1+X2      10   0.71441       0.55575     TRUE
## Individual fractions                                    
## [a] = X1|X2           7                 0.22901     TRUE
## [b]                   0                 0.24538    FALSE
## [c] = X2|X1           3                 0.08136     TRUE
## [d] = Residuals                         0.44425    FALSE
## ---
## Use function 'rda' to test significance of fractions of interest
plot(spe.part.all, digits=2)

The first partitioning shows that both sets of expl. variables contribute to the explanation of the species data. The unique contribution of the chemical variables (fraction [a], R2adj=0.229) is much larger than that of the physiography (R2adj=0.081). The variation explained jointly by the two sets (fraction [b], R2adj=0.245) is also large. This indicates that the chemical and physiographic variables are intercorrelated => good reason for parsimony

db-RDA

  • Distance-based Redundancy analysis

Community composition data must be transformed before they are used: chord, Hellinger, etc. Ecologist may want to compute RDA based on other dissimilarity measures that cannot be computed by a data transformation followed by the calculation of the Euclidean distance: db-RDA

  • Dunction capscale from vegan
# distance-based redundancy analysis
spe.bray <- vegdist (spe,'bray')
#response matrix can be raw data
dbrda1<-capscale(spe~pH+har+pho+nit+amm+oxy+bdo+ Condition(alt+slo+flo), distance='bray', data=env,add=T) 
# response matrix can be a dissimilarity matrix
dbrda2<-capscale(spe.bray~pH+har+pho+nit+amm+oxy+bdo+ Condition(alt+slo+flo), distance='bray', data=env,add=T) 

CCA

Pending